Computer-implemented methods and systems for modeling and recognition of speech

ABSTRACT

In accordance with the present invention, computer implemented methods and systems are provided for representing and modeling the temporal structure of audio signals. In response to receiving a signal, a time-to-frequency domain transformation on at least a portion of the received signal to generate a frequency domain representation is performed. The time-to-frequency domain transformation converts the signal from a time domain representation to the frequency domain representation. A frequency domain linear prediction (FDLP) is performed on the frequency domain representation to estimate a temporal envelope of the frequency domain representation. Based on the temporal envelope, one or more speech features are generated.

CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation of and claims priority under 35U.S.C. §120 to U.S. patent application Ser. No. 11/090,728, filed Mar.25, 2005, and entitled “Computer-Implemented Methods and Systems forModeling and Recognition of Speech,” which is a continuation of U.S.patent application Ser. No. 11/000,874, filed Dec. 1, 2004, which claimsthe benefit under 35 U.S.C. § 119(e) of U.S. Provisional PatentApplication Nos. 60/525,947, filed Dec. 1, 2003, and 60/578,985, filedJun. 10, 2004, which are hereby incorporated by reference herein intheir entireties.

STATEMENT REGARDING FEDERALLY SPONSORED RESEARCH OR DEVELOPMENT

The government may have certain rights in the present invention pursuantto grants from the Effective, Affordable, Reusable Speech-to-Text(EARS-NA) program at the Defense Advanced Research Projects Agency(DARPA), Contract No. MDA972-02-1-0024.

FIELD OF THE INVENTION

The present invention generally relates to sound recognition. Moreparticularly, the present invention relates to modeling audio signalsfor speech recognition, sound encoding and decoding, and artificialsound synthesis.

BACKGROUND OF THE INVENTION

In recent years, automatic speech recognition (ASR) systems have beenemployed in a wide variety of areas, such as, for example, telephonedialing, directory assistance, order entry, home banking, databaseinquiry, and dictation. For example, cellular telephones commonly employASR systems to simplify the user interface. Using ASR systems, manycellular telephones recognize and execute commands to initiate anoutgoing phone call or answer an incoming phone call. For example, acellular telephone having an ASR system may recognize a spoken name froma phone book or a contact list and automatically initiate a phone callto the phone number associated with the spoken name.

In an ASR system, a user speaks into a microphone (i.e., inputs a speechsignal). The inputted analog signal is digitized and the blocks ofdigital data are then transformed from the time domain into thefrequency domain using a digital signal processing (DSP) chip. Once theASR system has digitized the signal and calculated certain parameters,the system compares the signal to a library of known phrases and findsthe closest match.

To extract the features from the signal for comparison with data in thelibrary, such ASR systems generally use short-term spectral features,such as mel-frequency cepstral (i.e., frequency-related) coefficients(MFCC). MFCCs are based on a Fast Fourier Transform (FFT), whichconverts the inputted signal from a time domain representation to afrequency domain representation. The MFCC representation is an exampleof an approach that further analyzes the FFT of the signal. The MFCCrepresentation is generated by using a mathematical transformationcalled the cepstu which computes the inverse Fourier transform of thelog-spectrum of the speech signal.

These ASR systems uniformly employ short-time spectral analysis, usuallyover windows of about 10 to 30 milliseconds, as the basis for acousticrepresentations. It should be noted, however, that the detailed timestructure below this timescale is lost and the time structure above thislevel is weakly represented in the form of deltas. The temporalstructure in sub-10 millisecond transient segments contains importantcues for both the perception of natural sounds as well as theunderstanding of stop bursts in speech. The gross temporal distributionof acoustic energy in windows of up to 1 second is a successful domainfor the recognition of complete phonemes and the description of theirdynamics. Thus, while the spectral structures resulting from thespectral analysis convey important linguistic information, they are onlya partial representation of speech signals.

Other feature extraction techniques, such as, for example, dynamic(delta) features and relative spectra processing technique (RASTA), havebeen adopted as post-processing techniques that operate on sequences ofthe short-term feature vectors. Such techniques provide a“locally-global” view in which features to be used in classification arebased upon a speech segment of about one syllable's length.

Accordingly; it is desirable to provide systems and methods thatovercome these and other deficiencies of the prior art.

SUMMARY OF THE INVENTION

In accordance with the present invention, computer implemented methodsand systems are provided for representing and modeling the temporalstructure of audio signals.

In accordance with some embodiments of the present invention, computerimplemented methods and systems of extracting speech features fromsignals for use in performing automatic speech recognition are provided.In response to receiving a signal, a time-to-frequency domaintransformation on at least a portion of the received signal to generatea frequency domain representation is performed. The time-to-frequencydomain transformation converts the signal from a time domainrepresentation to the frequency domain representation. A frequencydomain linear prediction (FDLP) is performed on the frequency domainrepresentation to estimate a temporal envelope of the frequency domainrepresentation. Based on the temporal envelope, one or more speechfeatures are generated.

In some embodiments, the time-to-frequency domain transformation isperformed by applying a discrete cosine transform (DCT) or a discreteFourier transform on the portion of the received signal.

In some embodiments, the frequency domain linear prediction may includeselecting a temporal window to apply the linear prediction andautomatically determining a pole rate to distribute poles for modelingthe temporal envelope. The poles generally characterize the temporalpeaks of the temporal envelope. The pole rate may be automaticallydetermined to capture both gross variation and stop burst transients ofthe signal.

In some embodiments, an index of sharpness may be extracted from each ofthe poles. The index of sharpness of the FDLP poles {ρ_(i)} is definedas

$\rho_{i} = {\frac{1}{1 - {\rho_{i}}}.}$

In some embodiments, the frequency domain linear prediction is performedby estimating the square of the Hilbert envelope of the signal orcalculating the inverse Fourier transform of the magnitude-squaredFourier transform of a portion of the frequency domain representationraised to a given power. When the given power is 1, the autocorrelationof the single sided (positive frequency) spectrum is calculated.Alternatively, when the given power is not 1, the pseudoautocorrelationis calculated. The autocorrelation of the spectral coefficients may beused to predict the temporal envelope of the signal.

In accordance with some embodiments of the present invention, thefrequency domain representation may be divided into a plurality offrequency bands. A FDLP polynomial may then be fitted to each of theplurality of frequency bands. Temporal envelopes may be extracted fromeach of the plurality of frequency bands using the fitted FDLPpolynomial.

In some embodiments, the frequency domain representation may be dividedby logarithmically splitting the frequency domain representation intothe plurality of frequency bands.

In accordance with some embodiments of the present invention, computerimplemented methods and systems of extracting speech features fromsignals are provided. In response to receiving a signal, atime-to-frequency domain transformation on at least a portion of thereceived signal to generate a frequency domain representation isperformed. The time-to-frequency domain transformation converts thesignal from a time domain representation to the frequency domainrepresentation. The frequency domain representation may be divided intoa plurality of frequency bands and a FDLP polynomial may be fitted toeach of the plurality of frequency bands. Temporal envelopes may beextracted from each of the plurality of frequency bands using the fittedFDLP polynomial. Spectral envelopes may be constructed by takingsimultaneous points in the temporal envelopes. A smooth envelope may befitted to each of the spectral envelopes. Based on the temporal andspectral envelopes, one or more speech features are generated. This issometimes referred to herein as “PLP² modeling.”

These methods and systems for modeling the temporal structure of thesignal may be used to improve sound recognition (in particular, speechrecognition), sound encoding and decoding, and artificial soundsynthesis.

There has thus been outlined, rather broadly, the more importantfeatures of the invention in order that the detailed description thereofthat follows may be better understood, and in order that the presentcontribution to the art may be better appreciated. There are, of course,additional features of the invention that will be described hereinafterand which will form the subject matter of the claims appended hereto.

In this respect, before explaining at least one embodiment of theinvention in detail, it is to be understood that the invention is notlimited in its application to the details of construction and to thearrangements of the components set forth in the following description orillustrated in the drawings. The invention is capable of otherembodiments and of being practiced and carried out in various ways.Also, it is to be understood that the phraseology and terminologyemployed herein are for the purpose of description and should not beregarded as limiting.

As such, those skilled in the art will appreciate that the conception,upon which this disclosure is based, may readily be utilized as a basisfor the designing of other structures, methods and systems for carryingout the several purposes of the present invention. It is important,therefore, that the claims be regarded as including such equivalentconstructions insofar as they do not depart from the spirit and scope ofthe present invention.

These together with other objects of the invention, along with thevarious features of novelty which characterize the invention, arepointed out with particularity in the claims annexed to and forming apart of this disclosure. For a better understanding of the invention,its operating advantages and the specific objects attained by its uses,reference should be had to the accompanying drawings and descriptivematter in which there is illustrated preferred embodiments of theinvention.

BRIEF DESCRIPTION OF THE DRAWINGS

Various objects, features, and advantages of the present invention canbe more fully appreciated with reference to the following detaileddescription of the invention when considered in connection with thefollowing drawings, in which like reference numerals identify likeelements.

FIG. 1 is a simplified illustration of a spectrogram of a speech sampleand a spectrogram of the discrete cosine transformation (DCT) of thespeech sample in accordance with some embodiments of the presentinvention.

FIG. 2 is a simplified illustration of one example of a waveform andtemporal envelopes of the waveform with various poles in accordance withsome embodiments of the present invention.

FIG. 3 is a simplified illustration of one example of a subbandfrequency-domain linear prediction (FDLP) in accordance with someembodiments of the present invention.

FIG. 4 is a simplified illustration of one example of a waveform, atemporal envelopes of the waveform modeled by FDLP, and a Gaussianwindow of the waveform in accordance with some embodiments of thepresent invention.

FIG. 5 is a simplified illustration of one example of a spectrogram ofthe speech sample, a per-frame maximum of the temporal envelope of thesample extracted in each band by FDLP, and sharpness index features inaccordance with some embodiments of the present invention.

FIG. 6 shows the comparison between word-level confusion matrices inaccordance with some embodiments of the present invention.

FIG. 7 is a simplified illustration of one example of a subband FDLP andone example of a PLP² in accordance with some embodiments of the presentinvention.

FIG. 8 is a simplified illustration of PLP² having pole locations inaccordance with some embodiments of the present invention.

FIG. 9 shows the mean-squared differences between the log-magnitudesurfaces obtained in successive iterations of the PLP² analysis inaccordance with some embodiments of the present invention.

FIG. 10 is a simplified flowchart illustrating the steps performed inusing frequency domain linear prediction to estimate the temporalenvelope of a frequency domain representation in accordance with someembodiments of the present invention.

FIG. 11 is a simplified flowchart illustrating the steps performed incombining the temporal information extracted by FDLP with spectralinformation extracted by PLP to extract one or more speech features inaccordance with some embodiments of the present invention.

FIG. 12 is a schematic diagram of an illustrative system suitable forimplementation of an application that uses the temporal structure modelin accordance with some embodiments of the present invention.

FIG. 13 is a detailed example of the server and one of the workstationsthat may be used in accordance with some embodiments of the presentinvention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

The following description includes many specific details. The inclusionof such details is for the purpose of illustration only and should notbe understood to limit the invention. Moreover, certain features whichare well known in the art are not described in detail in order to avoidcomplication of the subject matter of the present invention. Inaddition, it will be understood that features in one embodiment may becombined with features in other embodiments of the invention.

In accordance with the present invention, computer implemented methodsand systems are provided for representing and modeling the temporalstructure of audio signals. More particularly, the methods and systemsprovide a compact representation of an audio signal that includessubstantial detail about its temporal structure such that accuratemodeling, classification, recognition, and/or resynthesis may beperformed

In some embodiments, a representation of the temporal envelope indifferent frequency bands is provided by exploring the dual of linearprediction when applied in the transform domain. With this technique offrequency domain linear prediction, the poles of the model describetemporal, rather than spectral, peaks. By using analysis windows on theorder of hundreds of milliseconds, a processor may perform a procedurethat automatically determines how to distribute poles or the pole rateto best model the temporal structure within the window. By taking anindex describing the sharpness of individual poles within a window, asubstantial improvement to the word error rate is shown.

Using the representation of the temporal envelope, the processor mayadaptively capture fine temporal nuances with millisecond accuracy whileat the same time summarize the signal's gross temporal evolution intimescales of about 500 milliseconds or more. Fine time-adaptiveaccuracy may be used to pin-point significant moments in time such as,for example, those associated with transient events like stop bursts. Atthe same time, the long-timescale summarization power of temporalenvelopes provide the ability to train, for example, speech recognizerson complete linguistic units lasting longer than 10 milliseconds andlearning acoustically-feasible phoneme sequences.

The representation of the temporal envelope of a signal is createdgenerally by applying a discrete cosine transform (DCT) on long timeframes and a frequency domain linear prediction (FDLP) on the output ofthe DCT.

The DCT generally appears as a post-processing step in featureextractors for automatic speech recognition. The forward DCT of an Npoint real sequence x[n] may be defined as:

${X_{DCT}\lbrack k\rbrack} = {{a\lbrack k\rbrack}{\sum\limits_{n = 0}^{N - 1}{{x\lbrack n\rbrack}{\cos \left( \frac{\left( {{2n} + 1} \right)\pi \; k}{2N} \right)}}}}$

where k=0, 1, . . . , N−1 and

${a\lbrack k\rbrack} = \left\{ \begin{matrix}1 & {k = 0} \\\sqrt{2} & {{k = 1},2,\ldots \mspace{14mu},{N - 1}}\end{matrix} \right.$

In some embodiments, the DCT may be used to approximate the envelope ofthe dicrete Fourier transform (DFT). Denoting as X_(DFT)[k], the DFT ofa length 2N zero-padded version of x[n], it has been determined that theenvelope of the DCT is bounded by the envelope of the zero-padded DFTand are related by:

${X_{DCT}\lbrack k\rbrack} = {{a\lbrack k\rbrack}{{X_{DFT}\lbrack k\rbrack}}{\cos \left( {{\theta \lbrack k\rbrack} - \frac{\pi \; k}{2N}} \right)}}$

where k=0, 1, . . . , N−1, and |X_(DFT)[k]| and θ[k] are the magnitudeand phase of the zero-padded DFT, respectively.

FIG. 1 is a simplified illustration of a spectrogram of a speech sampleand a spectrogram of the discrete cosine transformation (DCT) of thespeech sample in accordance with some embodiments of the presentinvention. As shown in FIG. 1, spectrogram 110 is of a 2 second speechsample and spectrogram 120 is of a DCT transform of the whole sample(treating the DCT output sequence as a sequence in time). It should benoted that while the DCT spectrogram 120 appears to be a mirror image ofthe regular spectrogram 110, it is not due to the cosine modulating termin the above-mentioned equation.

FDLP, the frequency domain dual of the time domain linear prediction(TDLP), is the part of the model that provides the time adaptivebehavior. TDLP is fully familiar to those of ordinary skill in the art.Applying FDLP analysis estimates the temporal envelope of the signal,and in particular, is the square of its Hilbert envelope,

e(t)=F ⁻¹ {{tilde over (X)}(ç)· {tilde over (X)}(ç−f)dç}

The Hilbert envelope is the inverse Fourier transform of theautocorrelation of the single sided (positive frequency) spectrum {tildeover (X)}(f). The autocorrelation of the spectral coefficients may beused to predict the temporal envelope of the signal.

In some embodiments, the frequency domain linear prediction is performedby calculating the inverse Fourier transform of the magnitude-squaredFourier transform of a portion of the frequency domain representationraised to a given power. When the given power is 1, the autocorrelationis calculated (as shown, in the equation above). When the given power isnot 1, the psuedoautocorrelation is calculated.

FIG. 2 is a simplified illustration of one example of a waveform 210 andtemporal envelopes 220, 230, and 240 of the waveform 210 with variouspoles 250 in accordance with some embodiments of the present invention.FIG. 2 shows a 256 millisecond long speech segment at a 8 kHz samplerate. After using the processor to take the 2048 point DCT of the wholesample, the processor fits a single FDLP polynomial to the DCT and thenextracts the temporal envelope of the segment. Note that FIG. 2 showsthe tradeoffs involved in model order selection (defining pole rate).When the processor generates an envelope 220 having 10 poles, theresulting envelope is too smooth and provides only a looseapproximation. On the other hand, when the processor generates anenvelope 240 having 40 poles, the resulting envelope is starting to fitthe pitch pulses, which is generally something to avoid forEnglish-language automatic speech recognition. When the processorgenerates an envelope 230 having 20 poles, this resulting envelopestrikes a good balance as it captures both the gross variation as wellas the stop burst transients in the beginning of the sample. Envelope230 has a pole rate of 20 poles per 256 milliseconds or about 0.1poles/ms. In accordance with some embodiments of the present invention,20 poles per 256 milliseconds or a pole rate of 0.1 poles/millisecond isadvantageously used in order to generate the model. It should be noted,however, that the poles are distributed adaptively within the 256 mswindow, thereby providing flexibility to the model. It should also benoted that any suitable number of poles or pole rate may be determinedand used by the processor.

FIG. 3 is a simplified illustration of one example of a subbandfrequency-domain linear prediction (FDLP) in accordance with someembodiments of the present invention. In FIG. 3, the same 256 ms longsample is used, but the processor applies FDLP on fourlogarithmically-split octave bands 310, 320, 330, and 340. Moreparticularly, each band represents a range of frequencies: 0-0.5, 0.5-1,1-2, and 2-4 kHz, respectively. It should be noted that the same polerate of 20 poles per 256 ms for each band is used. It should also benoted that the high frequency band is resolving the transient while thelow frequency band is capturing the gross spectral variation. Thisapproach is sometimes referred to herein as “subband FDLP.” Bytransforming longer 256 ms blocks of signal (which is extensible toseconds or more), enough variation is captured to manifest itself assignificantly different temporal envelopes between bands.

This approach provides a new parameter space from which features may beextracted for use in, for example, automatic speech recognition. Thereare many approaches in which the above-mentioned temporal envelopeinformation modeled in FDLP may be converted into features for use inspeech recognizers.

In some embodiments, the temporal envelopes may be used directly. Theenvelopes as shown in FIG. 2 are samples DFTs of the impulse responses(IR) of the all-pole filters that have been fit to the frequency domain.The basic linear prediction may be suitable for direct transformationinto temporal-based features such as modulation spectra. In addition,relationships such as the direct transformation from predictioncoefficients to cepstra may provide decorrelated features describing thetemporal behavior in different subbands.

In another suitable embodiment, features may be derived from eachindividual pole in the model (i.e., the roots of the predictorpolynomial). The angle of the pole on the z-plane corresponds toaccurate timing information and the magnitude of the pole may provideknowledge about the energy of the signal. It should be noted that thisis a smoothed approximation to the true Hilbert envelope. The sharpnessof the pole (i.e., how closely it approaches the unit circle) relates tothe dynamics of the envelope. For example, a sharper pole indicates morerapid variation of the envelope at that time.

The index of sharpness of the FDLP poles {ρ_(i)} is defined by:

$\rho_{i} = {\frac{1}{1 - {\rho_{i}}}.}$

As pole magnitudes grow from zero to approaching the unit circle, ρ_(i)grows from 1 to an unbounded large positive value.

For each analysis frame in time, the full DCT is taken and FDLP isperformed on 4 log bands using 20 poles per band. The choice of a 256 msanalysis window (2048 samples at 8 kHz) is, without loss of generality,dictated by computational considerations. Subbands are formed bybreaking up the DCT into subranges that are exact powers of two (e.g.,128, 256, 512 and 1024 points for a 4-way split). After modeling with 20poles per band per frame, the processor calculates the sharpness index.The sharpness indices may be scaled using a Gaussian window 410 toachieve a finer time resolution than the 256 ms window, as illustratedin FIG. 4, and the maximum value in each band in each frame is retained.The purpose of the window is to localize the sharpness values in thevicinity of the center of the frame. FIG. 5 visually compares these polesharpness features with direct measures of the subband energy. Afterexamining the distributions of the sharpness parameters, a logarithmictransform was added to make the distributions closer to Gaussian, andthus a better match to the statistical models.

Using a conventional HTK recognizer and Gaussian Mixture Model-HiddenMarkov Model (GMM-HMM) models that are trained on a mixture ofconversational and read speech using a combination of Switchboard,Callhome, and Macrophone databases, the temporal envelope modeled inFDLP was tested.

TABLE 1 Recognition of word error rate (WER) results. Features raw 20 kpad 85 k PLP12 4.97% 2.75% FDLP-4log 4.08% 2.90% FDLP-2log + dct 3.81%2.82% FDLP-3log + dct 2.61% FDLP-4log + dct 2.63% FDLP-5log + dct 2.69%FDLP-8bark + dct 4.38%Table 1 shows the recognition word error rate (WER) results. The firstline, “PLP12”, is the baseline system employing 12th order PLP features(plus deltas and double deltas). Subsequent systems augment thesefeatures with FDLP sharpness features in various guises. “FDLP-4 log”adds four elements to each feature vector, derived from 4logarithmically-spaced octave subbands (e.g., 0-500 Hz, 500 Hz-1 kHz,1-2 kHz, and 2-4 kHz). It should be noted that performing a final DCTdecorrelation on each frame of FDLP features improved recognition, asshown in the “FDLP-Xlog+dct” lines. Between two and five octave bands(where 2 octaves is 0-2 kHz and 2-4 kHz, and 5 octave bands is down to 0to 250 Hz) were used to find the best compromise between signal detailand model accuracy (since narrow frequency bands contain fewer frequencysamples with which to estimate the linear prediction parameters). Insome embodiments, dividing the frequency axis on a Bark scale, which isfully familiar to those of ordinary skill in the art, may be used toallow the use of more bands (since Bark bands do not get narrow soquickly in the low frequencies).

In some embodiments, padding each end of our test utterances with 100 msof artificial background noise silence may be beneficial. In someembodiments, all test set utterances marked as coming from the samespeaker may be normalized. Such changes may improve the WER from about4.97% to about 2.75%.

For the “raw 20 k” system, it should be noted that any kind ofFDLP-derived information improved word error rate with the greatestimprovement coming from augmenting the PLP features with decorrelated 4octave-subband FDLP sharpness features (“FDLP-4 log+dct”). The WERchanged from 4.97% to 3.81%, which represents a 23.3% relativeimprovement. With the larger, better-performing “pad 85 k” system, theimprovements from FDLP were smaller with the best improvement of 2.75%baseline WER to 2.61% for 3 subband decorrelated features (“FDLP-3log+dct”) constituting a 5% relative improvement.

FIG. 6 compares the word-level confusion matrices for the baseline “raw20 k” PLP system, and for the best performing “FDLP-4 log+dct” system.Looking at the absolute differences in error counts (middle pane), thegreatest differences is seen for the words “four” (fewer confusions with“forty”), “eight” and “six” (fewer deletions), and “five” (fewerconfusions with “nine”). It should be noted that most of these maindifferences involve stops (/t/ in “eight” and “forty”, and /k/ is“six”), which is consistent with our initial drive for the FDLPsharpness features, of capturing information about short-durationtransients in the speech signal.

Accordingly, FDLP analysis is advantageous because of its ability todescribe temporal structure without frame-rate quantization, and itsrich and flexible representation of temporal structure in the form ofpoles. This flexible, adaptive representation of the temporal structuremay be analyzed across the full-band or for arbitrarily-spaced subbands,and presents many possibilities for advanced speech recognitionfeatures.

Perceptual Linear Prediction (PLP) is another auditory-based approach tofeature extraction. In contrast to pure linear predictive analysis ofspeech, PLP generally uses several perceptually motivated transformsincluding Bark frequency, masking curves, etc. to modify the short-termspectrum of the speech. In accordance with some embodiments of thepresent invention, the temporal information extracted by FDLP may becombined with spectral information extracted using PLP.

As described above, a squared Hilbert envelope (the squared magnitude ofthe analytic signal) represents the total instantaneous energy in asignal, while the squared Hilbert envelopes of sub-band signals are ameasure of the instantaneous energy in the corresponding sub-bands.Deriving these Hilbert envelopes generally involve either using aHilbert operator in the time domain (which is difficult in practicebecause of its doubly-infinite impulse response), or the use of twoFourier transforms with modifications to the intermediate spectrum.Alternatively, an all-pole approximation of the Hilbert envelope may becalculated by computing a linear predictor for the positive half of theFourier transform of an even-symmetrized input signal, which isequivalent to computing the predictor from the cosine transform of thesignal. Such FDLP is the frequency-domain dual of the well-known TDLP.Similar to how the TDLP fits the power spectrum of an all-pole model tothe power spectrum of a signal, FDLP fits a “power spectrum” of anall-pole model (e.g., in the time domain) to the squared Hilbertenvelope of the input signal. To obtain such a model for a specificsub-band, the prediction may be based only on the corresponding range ofcoefficients from the original Fourier transform.

To summarize temporal dynamics, rather than capture every nuance of thetemporal envelope, the all-pole approximation to the temporal trajectoryoffers parametric control over the degree to which the Hilbert envelopeis smoothed (e.g., the number of peaks in the smoothed envelope cannotexceed half the order of the model).

Having an approach for estimating temporal envelopes in individualfrequency bands of the original signal permits the construction of aspectrogram-like signal representation. Just as a typical spectrogram isconstructed by appending individual short-term spectral vectorsalongside each other, a similar representation may be constructed byvertical stacking of the temporal vectors approximating the individualsub-band Hilbert envelopes, recalling the outputs of the separateband-pass filters used to construct the original, analog Spectrograph.This is shown in FIG. 7. The top panel 710 shows the time-frequencypattern obtained by short-term Fourier transform analysis and Bark scaleenergy binning to 15 critical bands, which is the way the short-termcritical-band spectrum is derived in PLP feature extraction. The secondpanel 720 shows the result of PLP smoothing, with each 15-point verticalspectral slice now smooth and continuous as a result of being fit with alinear prediction model. The third panel 730 is based on a series24-pole FDLP models, one for each Bark band, to give estimates of the 15subband squared Hilbert envelopes. Similar to PLP, cube-root compressionis applied here to the sub-band Hilbert envelope prior to computing theall-pole model of the temporal trajectory. The similarity of all thesepatterns is evident, but there are also some important differences:Whereas the binned, short-time spectrogram is ‘blocky’ in both time andfrequency, the PLP model gives a smooth, continuous spectral profile ateach time step. Conversely, the temporal evolution of the spectralenergy in each sub-band is much smoother in the all-pole FDLPrepresentation, constrained by the implicit properties of the temporalall-pole model.

In PLP, an auditory-like critical-band spectrum, obtained as theweighted summation of the short-term Fourier spectrum followed bycube-root amplitude compression, is approximated by an all-pole model inan approach similar to the way that linear prediction techniquesapproximate the linear-frequency short-term power spectrum of a signal.Subband FDLP offers an alternative approach to estimate the energy ineach critical band as a function of time, raising the possibility ofreplacing the short-term critical band spectrum in PLP with this newestimate. In doing so, a new representation of the critical-bandtime-frequency plane is obtained. However, comparing this newrepresentation to the subband FDLP spectrotemporal pattern (constrainedby the all-pole model along the temporal axis), the all-pole constraintis now along the spectral dimension of the pattern.

In some embodiments, the processor may repeat the processing along thetemporal dimension of the new representation to enforce the all-poleconstraints along the time axis. The outcome of such processing may besubject to another stage of all-pole modeling on the spectral axis. Itshould be noted that this alternation may be iterated until thedifference between successive representations is negligible.

As a result, the processor provides a two-dimensional spectro-temporalauditory-motivated pattern that is constrained by all-pole models alongboth the time and frequency axes. This is sometimes referred to hereinas “Perceptual Linear Prediction Squared” or “PLP².” The perceptualconstraints are derived from the use of a critical-band frequency axisand from the use of a 250 ms critical-timespan interval, whereas thelinear prediction (LP) portion indicates the use of all-pole modelingand the “squared” portion comes from the use of all-pole models alongboth the time and frequency axes.

In response to the processor taking the DCT of a 250 ms speech segment(equivalent to the Fourier transform of the related 500 ms evensymmetric signal) at a sampling rate of 8 kHz, about 2000 unique valuesin the frequency domain are generated. The processor may then dividethese into 15 bands with overlapping Gaussian windows whose widths andspacing select frequency regions of approximately one Bark, and apply24th order FDLP separately on each of the 15 bands such that eachpredictor approximates the squared Hilbert envelope of the correspondingsub-band. The processor computes the critical-band time-frequencypattern within the 250 ms time span by sampling each all-pole envelopeat 240 points (i.e. every 1.04 ms) and stacks the temporal trajectoriesvertically. This provides a 2-dimensional array amounting to aspectrogram, but constructed row-by-row, rather than column-by-column asin conventional short-term analysis. This time-frequency pattern is thestarting point for further processing.

In response to generating the above-mentioned time-frequency pattern,the processor may compute 240 12th-order time-domain LP (TDLP) models tomodel the spectra constituted by the 15 amplitude values in a verticalslice from the pattern at each of the 240 temporal sample points. Thespectral envelopes of these models are each sampled at 120 points (i.e.every 0.125 Bark) and stacked next to each other to form a new240×120=28,800 point spectro-temporal pattern. Each horizontal slice of240 points is modeled by the same process of mapping a compressedmagnitude “spectrum” to an autocorrelation and then to an all-polemodel, thereby yielding 120 24th-order FDLP approximations to thetemporal trajectories in the new fractional-Bark subbands. Samplingthese models on the same 240 point grid gives the next iteration of the28,800 point spectro-temporal pattern. The process may then repeatswhere it converges after a given number of iterations, where the numberof iterations required for convergence appears to depend on the modelsorders as well as the compression factor in the all-pole modelingprocess. The mean-squared difference between the logarithmic surfaces ofthe successive spectro-temporal patterns as a function of the iterationnumber is shown in FIG. 9, which shows stabilization after 10 iterationsin this example. (Although this plot shows that the differences betweensuccessive iterations do not decline all the way to zero, it should benoted that the residual changes in later iterations are immaterial;inspection of the time frequency distribution of these differencesreveals no significant structure.)

The final panel 740 of FIG. 7 shows the results of the new PLP² comparedwith conventional PLP. The increased temporal resolution in comparisonwith the 10 ms sampled PLP (second panel 720) is very clear; the secondimportant property of the PLP² surface is the increased spectralresolution in comparison with the 15 frequency values at each time forthe basic FDLP model (third panel 730).

In some embodiments, further insight may be obtained by plotting thepole locations on the time frequency plane. As shown in FIG. 8, the polelocations may be superimposed on a grayscale version of the PLP² patternpresented on the 4th panel of FIG. 7. Dots show the 12 FDLP poles foreach of the 120 subband envelope estimates. Due to the dense frequencysampling, the poles of adjacent bands are close in value, and the dotsmerge into near-vertical curves in the figure. Dots are the 6 TDLP polesat each of the 240 temporal sample points, and merge intonear-horizontal lines.

The blue TDLP poles track the smoothed formants in the t=0.14 to 0.24 sregion but fail to capture the transient at around 0.08 s. The red FDLPpoles, on the other hand, with their emphasis on temporal modeling, makean accurate description of this transient. As expected, neither TDLP orFDLP models track any energy peaks in the quiet region between 0 and0.08 s. But, while the TDLP models for these temporal slices are obligedto place their poles somewhere in this region, the FDLP models are freeto shift the majority of their poles into the later portion of the timewindow, between 0.08 and 0.25 s, where the bulk of the energy lies.

In some embodiments, after receiving a 250 ms segment of speech, theprocessor divides its DCT into 15 Bark bands. Each band is fit with a12th order FDLP polynomial, and the resulting smoothed temporal envelopeis sampled on a 10 ms grid. The central-most spectral slices are thensmoothed across frequency using the conventional PLP technique. However,it should be noted that any further iterations are not performed.Instead, the cepstra resulting from this stage are taken as replacementsfor the conventional PLP features and input to the recognizer.

Thus far, these features have indeed shown performance very close tostandard PLP features, achieving word error that differ by less than 2%relative. Although small, these differences are statistically verysignificant, and when the results from a PLP² system are combined withconventional system outputs using simple word-level voting, asignificant improvement in overall accuracy is achieved.

Alternatively, techniques for reducing the smoothed energy surface to alower-dimensional description appropriate for statistical classifiersinclude conventional basis decompositions such as Principal ComponentAnalysis or two-dimensional DCTs.

In another suitable embodiment, the pole locations may be viewed as areduced, parametric description of the energy concentrations. Forexample, recording the crossing points of the nearly-continuous time andfrequency pole trajectories may provide a highly compact description ofthe principal energy peaks in each 250 ms spectro-temporal window.

Accordingly, a new modeling scheme to describe the time and frequencystructure in short segments of sound is provided. This approach ofall-pole (linear predictive) modeling, applied in both the time andfrequency domains, allows one to smooth this representation toadaptively preserve the most significant peaks within this window inboth dimensions.

FIGS. 10 and 11 are generalized flow charts illustrating the stepsperformed in the modeling and representing of the temporal structure ofaudio signals in accordance with some embodiments of the presentinvention. It will be understood that the steps shown in these figuresmay be performed in any suitable order, some may be deleted, and othersadded.

FIG. 10 is a simplified flow chart illustrating the steps performed inextracting speech features from signals by using FDLP in accordance withsome embodiments of the present invention. At step 1010, the process mayreceive a signal (e.g., a waveform). In response to receiving a signal,a time-to-frequency domain transformation on at least a portion of thereceived signal to generate a frequency domain representation isperformed at step 1020. The time-to-frequency domain transformationconverts the signal from a time domain representation to the frequencydomain representation. In some embodiments, the time-to-frequency domaintransformation is performed by applying a discrete cosine transform(DCT) or a discrete Fourier transform on the portion of the receivedsignal.

At step 1030, the processor may divide the frequency domainrepresentation into a plurality of frequency bands. For example,subbands may be formed breaking up the frequency domain representationinto subranges. These subbands may be determined by logarithmicallysplitting the frequency domain representation into the plurality offrequency bands.

At step 1040, the processor may perform a frequency domain linearprediction (FDLP) on each of the frequency bands by, for example,fitting a FDLP polynomial. The frequency domain linear prediction isperformed by estimating the square of the Hilbert envelope of the signalor calculating the inverse Fourier transform of the magnitude-squaredFourier transform of a portion of the frequency domain representationraised to a given power. When the given power is 1, the autocorrelationof the single sided (positive frequency) spectrum is calculated.Alternatively, when the given power is not 1, the pseudoautocorrelationis calculated. The autocorrelation of the spectral coefficients may beused to predict the temporal envelope of the signal.

In some embodiments, the frequency domain linear prediction may includeselecting a temporal window to apply the linear prediction andautomatically determining a pole rate to distribute poles for modelingthe temporal envelope. The poles generally characterize the temporalpeaks of the temporal envelope. The pole rate may be automaticallydetermined to capture both gross variation and stop burst transients ofthe signal.

In some embodiments, an index of sharpness may be extracted from each ofthe poles. The sharpness of the pole relates to the dynamics of thetemporal envelope. The index of sharpness of the FDLP poles {ρ_(i)} isdefined as

$\rho_{i} = {\frac{1}{1 - {\rho_{i}}}.}$

Temporal envelopes may be extracted from each of the plurality offrequency bands using the fitted FDLP polynomial at step 1050.

In some embodiments, the temporal envelope may be used to generate atleast one speech feature. Speech features may be used for soundrecognition (in particular, speech recognition), sound encoding anddecoding, and artificial sound synthesis.

FIG. 11 is a simplified flowchart illustrating the steps performed incombining the temporal information extracted by FDLP with spectralinformation extracted by PLP to extract one or more speech features inaccordance with some embodiments of the present invention. In responseto extracting temporal envelopes from the audio signal using FDLP, theprocess may construct spectral envelopes by, for example, takingsimultaneous points in the temporal envelopes (step 1110). For example,the processor may compute time-domain linear prediction models to modelthe spectra constituted by the points in the temporal envelopes. In someembodiments, the processor may iterate the fitting in frequency and timedomains.

A smooth envelope may be fitted to each of the spectral envelopes atstep 1120. The smoothing of the spectral envelopes may be achieved byfitting a linear prediction polynomial to each of the spectralenvelopes. This may be performed by calculating the inverse Fouriertransform of the Fourier transform magnitude of the spectral enveloperaised to a given power. In some embodiments, the spectral envelopes maybe modified by a nonlinear warping of the frequency axis and/or the timeaxis.

Based on both the temporal and spectral envelopes, one or more speechfeatures are generated at step 1130. Speech features may be used forsound recognition (e.g., speech recognition), sound encoding anddecoding, and artificial sound synthesis. For example, an ASR system maybe tuned for various speech recognition tasks by using the improvedspeech features generated by the methods and systems of the presentinvention. Some applications in which such an ASR system with improvedspeech modeling may be used are, for example, cellular telephones (e.g.,automatic dialing in response to receiving a voice command), telephonedirectories, software for operating a computer, data entry, automobilecontrols, etc.

FIG. 12 is a schematic diagram of an illustrative system 1200 suitablefor implementation of an application that generates and uses thetemporal structure model for speech recognition, sound encoding, sounddecoding, and sound synthesis in accordance with some embodiments of thepresent invention. Referring to FIG. 12, an exemplary system 1200 forimplementing the present invention is shown. As illustrated, system 1200may include one or more workstations 1202. Workstations 1202 may belocal to each other or remote from each other, and are connected by oneor more communications links 1204 to a communications network 1206 thatis linked via a communications link 1208 to a server 1210.

In system 1200, server 1210 may be any suitable server for providingaccess to the application or to the temporal structure model, such as aprocessor, a computer, a data processing device, or a combination ofsuch devices. Communications network 1206 may be any suitable computernetwork including the Internet, an intranet, a wide-area network (WAN),a local-area network (LAN), a wireless network, a digital subscriberline (DSL) network, a frame relay network, an asynchronous transfer mode(ATM) network, a virtual private network (VPN), or any combination ofany of the same. Communications links 1204 and 1208 may be anycommunications links suitable for communicating data betweenworkstations 1202 and server 1210, such as network links, dial-up links,wireless links, hard-wired links, etc. Workstations 1202 enable a userto access features using the temporal structure model. Workstations 1202may be personal computers, laptop computers, mainframe computers, dumbterminals, data displays, Internet browsers, personal digital assistants(PDAs), two-way pagers, wireless terminals, portable telephones, etc.,or any combination of the same.

The server and one of the workstations, which are depicted in FIG. 12,are illustrated in more detail in FIG. 13. Referring to FIG. 13,workstation 1202 may include processor 1302, display 1304, input device1306, and memory 1308, which may be interconnected. In a preferredembodiment, memory 1308 contains a storage device for storing aworkstation program for controlling processor 1302. Memory 1308 alsopreferably contains the application according to the invention.

In some embodiments, the application may include an application programinterface (not shown), or alternatively, as described above, theapplication may be resident in the memory of workstation 1202 or server1210. In another suitable embodiment, the only distribution to the usermay be a Graphical User Interface which allows the user to interact withthe application resident at, for example, server 1210.

In one particular embodiment, the application may include client-sidesoftware, hardware, or both. For example, the application may encompassone or more Web-pages or Web-page portions (e.g., via any suitableencoding, such as HyperText Markup Language (HTML), Dynamic HyperTextMarkup Language (DHTML), Extensible Markup Language (XML), JavaServerPages (JSP), Active Server Pages (ASP), Cold Fusion, or any othersuitable approaches).

Although the application is described herein as being implemented on aworkstation, this is only illustrative. The application may beimplemented on any suitable platform (e.g., a personal computer (PC), amainframe computer, a dumb terminal, a data display, a two-way pager, awireless terminal, a portable telephone, a portable computer, a palmtopcomputer, a H/PC, an automobile PC, a laptop computer, a personaldigital assistant (PDA), a combined cellular phone and PDA, etc.) toprovide such features.

Processor 1302 uses the workstation program to present on display 1304the application and the data received through communication link 1204and commands and values transmitted by a user of workstation 1202. Inputdevice 1306 may be a computer keyboard, a cursor-controller, amicrophone, a dial, a switchbank, lever, or any other suitable inputdevice as would be used by a designer of input systems or processcontrol systems.

Server 1210 may include processor 1320, display 1322, input device 1324,and memory 1326, which may be interconnected. In a preferred embodiment,memory 1326 contains a storage device for storing data received throughcommunication link 1208 or through other links, and also receivescommands and values transmitted by one or more users. The storage devicefurther contains a server program for controlling processor 1320.

It will also be understood that the detailed description herein may bepresented in terms of program procedures executed on a computer ornetwork of computers. These procedural descriptions and representationsare the means used by those skilled in the art to most effectivelyconvey the substance of their work to others skilled in the art.

A procedure is here, and generally, conceived to be a self-consistentsequence of steps leading to a desired result. These steps are thoserequiring physical manipulations of physical quantities. Usually, thoughnot necessarily, these quantities take the form of electrical ormagnetic signals capable of being stored, transferred, combined,compared and otherwise manipulated. It proves convenient at times,principally for reasons of common usage, to refer to these signals asbits, values, elements, symbols, characters, terms, numbers, or thelike. It should be noted, however, that all of these and similar termsare to be associated with the appropriate physical quantities and aremerely convenient labels applied to these quantities.

Further, the manipulations performed are often referred to in terms,such as adding or comparing, which are commonly associated with mentaloperations performed by a human operator. No such capability of a humanoperator is necessary, or desirable in most cases, in any of theoperations described herein which form part of the present invention;the operations are machine operations. Useful machines for performingthe operation of the present invention include general purpose digitalcomputers or similar devices.

The present invention also relates to apparatus for performing theseoperations. This apparatus may be specially constructed for the requiredpurpose or it may comprise a general purpose computer as selectivelyactivated or reconfigured by a computer program stored in the computer.The procedures presented herein are not inherently related to aparticular computer or other apparatus. Various general purpose machinesmay be used with programs written in accordance with the teachingsherein, or it may prove more convenient to construct more specializedapparatus to perform the required method steps. The required structurefor a variety of these machines will appear from the description given.

The system according to the invention may include a general purposecomputer, or a specially programmed special purpose computer. The usermay interact with the system via e.g., a personal computer or over PDA,e.g., the Internet an Intranet, etc. Either of these may be implementedas a distributed computer system rather than a single computer.Similarly, the communications link may be a dedicated link, a modem overa POTS line, the Internet and/or any other method of communicatingbetween computers and/or users. Moreover, the processing could becontrolled by a software program on one or more computer systems orprocessors, or could even be partially or wholly implemented inhardware.

Although a single computer may be used, the system according to one ormore embodiments of the invention is optionally suitably equipped with amultitude or combination of processors or storage devices. For example,the computer may be replaced by, or combined with, any suitableprocessing system operative in accordance with the concepts ofembodiments of the present invention, including sophisticatedcalculators, hand held, laptop/notebook, mini, mainframe and supercomputers, as well as processing system network combinations of thesame. Further, portions of the system may be provided in any appropriateelectronic format, including, for example, provided over a communicationline as electronic signals, provided on CD and/or DVD, provided onoptical disk memory, etc.

Any presently available or future developed computer software languageand/or hardware components can be employed in such embodiments of thepresent invention. For example, at least some of the functionalitymentioned above could be implemented using Visual Basic, C, C++ or anyassembly language appropriate in view of the processor being used. Itcould also be written in an object oriented and/or interpretiveenvironment such as Java and transported to multiple destinations tovarious users.

It is to be understood that the invention is not limited in itsapplication to the details of construction and to the arrangements ofthe components set forth in the following description or illustrated inthe drawings. The invention is capable of other embodiments and of beingpracticed and carried out in various ways. Also, it is to be understoodthat the phraseology and terminology employed herein are for the purposeof description and should not be regarded as limiting.

As such, those skilled in the art will appreciate that the conception,upon which this disclosure is based, may readily be utilized as a basisfor the designing of other structures, methods and systems for carryingout the several purposes of the present invention. It is important,therefore, that the claims be regarded as including such equivalentconstructions insofar as they do not depart from the spirit and scope ofthe present invention.

Although the present invention has been described and illustrated in theforegoing exemplary embodiments, it is understood that the presentdisclosure has been made only by way of example, and that numerouschanges in the details of implementation of the invention may be madewithout departing from the spirit and scope of the invention, which islimited only by the claims which follow.

The following references are incorporated by reference herein in theirentireties:

-   M. Athineos and D. P. W. Ellis, “Sound texture modeling with linear    prediction in both time and frequency domains,” in Proc. ICASSP,    2003, vol. 5, pp. 648-651-   H. Hermansky and S. Sharma, “Temporal patterns (TRAPs) in ASR of    noisy speech,” in Proc. ICASSP, March 1999, vol. 1, pp. 289-292.-   H. Hermansky and N. Morgan, “RASTA processing of speech,” in Trans.    Speech and Audio Processing, October 1994, vol. 2:4, pp. 578-589.-   J. Tribolet and R. Crochiere, “Frequency domain coding of speech,”    in Trans. ASSP, October 1979, vol. 27, pp. 512-530.-   J. Herre and J. D. Johnston, “Enhancing the Performance of    Perceptual Audio Coders by Using Temporal Noise Shaping (TNS),” in    Proc. 101 st AES Conv., November 1996.-   L. Rabiner and R. Schafer, Digital processing of speech signals,    Prentice Hall, 1978.-   Ozgur Cetin and Mari Ostendorf, “Cross-stream observation    dependencies for multi-stream speech recognition,” in Eurospeech,    Geneva, 2003.-   P. Somervuo, B. Chen, and Q. Zhu, “Feature transformations and    combinations for improving ASR performance,” in Eurospeech, Geneva,    2003.-   H. Hermansky, H. Fujisaki, and Y. Sato, “Analysis and synthesis of    speech based on spectral transform linear predictive method,” in    Proc. ICASSP, April 1983, vol. 8, pp. 777-780.-   S. Sharma, H. Versnel, and N. Kowalski, “Ripple analysis in ferret    primary auditory cortex: 1. Response characteristics of single units    to sinusoidally rippled spectra,” Aud. Neurosci., vol. 1, 1995.-   D. Klein, D. Depireux, J. Simon, and S. Sharma, “Robust    spectro-temporal reverse correlation for the auditory system:    Optimizing stimulus design,” J. Comput. Neurosci, vol. 9, 2000.-   H. Hermansky, “Exploring temporal domain for robustness in speech    recognition,” in Proc. of 15th International Congress on Acoustics,    vol. 11, Trondheim, Norway, June 1995.-   H. Hermansky, “Should recognizers have ears” Speech Communication,    vol. 25, 1998.-   H. Hermansky and S. Sharma, “TRAPS—classifiers of temporal    patterns,” in Proc. ICSLP, Sydney, Australia, 1998.-   P. Jain and H. Hermansky, “Beyond a single critical band in TRAP    based ASR,” in Proc. Eurospeech, Geneva, Switzerland, November 2003.-   S. Makino, T. Kawabata, and K. Kido, “Recognition of consonant based    on the perception model,” in Proc. ICASSP, Boston, Mass., 1983.-   P. Brown, “The acoustic-modeling problem in automatic speech    recognition,” Ph.D. dissertation, Computer Science Department,    Carnegie Mellon University, 1987.-   H. Hermansky, D. Ellis, and S. Sharma, “Connectionist feature    extraction for conventional hmm systems,” in Proc. ICASSP, Istanbul,    Turkey, 2000.-   M. Fanty and R. Cole, “Spoken letter recognition,” in Advances in    Neural Information Processing Systems 3, Morgan Kaufmann Publishers,    Inc., 1990.-   S. Sharma, D. Ellis, S. Kajarekar, P., Jain, and H. Hermansky,    “Feature extraction using non-linear transformation for robust    speech recognition on the AURORA data-base,” in Proc. ICASSP,    Istanbul, Turkey, 2000.-   P. Schwartz, P. Matejka, and J. Cernocky, “Recognition of phoneme    strings using TRAP technique,” in Proc. Eurospeech, Geneva,    Switzerland, September 2003.-   M. Athineos and D. Ellis, “Frequency-domain linear prediction for    temporal features,” in Proc. IEEE ASRU Workshop, St. Thomas, US    Virgin Islands, December 2003.-   H. Hermansky, H. Fujisaki, and Y. Sato, “Analysis and synthesis of    speech based on spectral transform linear predictive method,” in    Proc. ICASSP, vol. 8, April 1983, pp. 777-780.-   R. Koenig, H. Dunn, and L. Lacey, “The sound spectrograph,” J.    Acoust. Soc. Am., vol. 18, pp. 19-49,1946.-   H. Hermansky, “Perceptual linear predictive (PLP) analysis of    speech,” J. Acoust. Soc. Am., vol. 87:4, April 1990.-   M. Athineos, H. Hermansky, and D. Ellis, “PLP²: Autoregressive    modeling of auditory-like 2-D spectro-temporal patterns,” Submitted    to SAPA-04, Jeju Island, Korea, October 2004.

1. A method of extracting speech features from signals for use inperforming automatic speech recognition, the method comprising:receiving a signal; performing a time-to-frequency domain transformationon at least a portion of the received signal to generate a frequencydomain representation; performing a frequency domain linear predictionon the frequency domain representation to estimate a temporal envelopeof the frequency domain representation; and generating at least onespeech feature based at least in part on the temporal envelope.